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Abstract 

We present a theoretical study of a problem arising in database query optimization [1] , 
which we call as The Common Prefix Problem. We present a (1 — o(l)) factor approxi- 
mation algorithm for this problem, when the underlying graph is a binary tree. We then 
use a result of Feige and Kogan [2 to show that even on stars, the problem is hard to 
approximate. 

1 Problem 

Let T be a tree with V as its vertex set and E as its edge set. Let each vertex v be associated 
with a set of labels S v , taken from an alphabet E. Suppose that the vertices v and u are 
adjacent and their corresponding labels are given permutations P v and P u . We define the 
benefit of the edge uv as the length of the largest common prefix, denoted by P V AP U . The goal 
is to maximize the total benefit by permuting the labels associated with each vertex appropri- 
ately. More precisely, find permutations Pi, P%, . . . , P\yi, so as to maximize YluveE |-F« A P v |. 
The corresponding decision problem is known to be NP — Complete [lj. It can be solved in 
polynomial time if the tree is a path, and a 1/2-factor approximation is known for the case of 
a binary tree [1J. In this paper give a (1 — o(l)) factor algorithm for this problem on binary 
trees. We then study the problem when the underlying graph is a star (Ki r ) and prove a 
hardness of approximation result by relating this problem to the Maximum Edge Biclique 
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problem. Throughout the paper we assume that the size of the alphabet S is a constant. 



2 Optimal Recursion For Trees 

In this section we give a recursion to optimally solve Common Prefix on trees. This recursion 
may run in exponential time. In the next section we will run this on sufficiently small trees 
to get the (1 — o(l)) factor algorithm. 

We observe that the labels that are common to all vertices can always be put as prefixes to 
the permutations associated with the vertices. If the first label in the permutation associated 
with each vertex is the same, then we have a label common to all vertices. Hence, once the 
common labels are removed, there will be an edge with zero benefit in the optimal. This we 
can delete from the tree T, and recurse as follows. 



where T\ and T2 are the two connected components of T \ e. However solving this recursion 
may involve steps exponential in the number of nodes for example, on a complete binary tree 
of size n. The recursion- ([T]) can be implemented as a dynamic program for trees which have 
a polynomially bounded number of subtrees, for example, paths. We show that binary trees 
of height log log n also have this property. 

Claim: The total number of subtrees in a binary tree of height log log n is at most n 2 . 

Proof: The total number of nodes in a binary tree of height h is at most 2 h . Connecting 
each subset of the vertices to the root yields a subtree containing the root, so there are at 
most 2 2 '* such subtrees. Thus the total number of subtrees in a binary tree of height h is at 
most 
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If h equals log log n, we get the desired result. □ 

It follows that the recursion- ([T]) can be solved optimally in time 0(n 2 ) on binary trees of 
height log log n. We use this to give a (1 — log w n ) factor approximation for Common Prefix 
on binary trees. 



3 (1 - 0(1)) Factor Algorithm 

Consider a binary tree T, of height h on n vertices rooted at vertex r. We split T into sets 
A%, A2, . . . , ^logiognj each consisting of subtrees of height at most loglogn. A\ consists of 
the subtrees obtained by deleting the edges joining vertices from heights i loglogn — 1 and 
i loglogn for 1 < i < [~ log (^ gn 1 . A2 consists of subtrees obtained by deleting the edges joining 
vertices from heights i loglogn and i loglogn + 1 for < i < [ log |^ gw ] and so on. Each 
Ai consists of vertex disjoint subtrees of height at most loglogn. Since each Ai contains no 
more than n subtrees, we can solve Common Prefix on each Ai optimally. We denote the 
optimal value for Ai by OPTcp(Ai). Note that each edge occurs in all but one of the A^s. 
Let b e denote the benefit of the edge e in the optimal, and let A denote the maximum of all 
OPTcp(AiYs. Then from the preceding discussion we have, 

(loglogn- l)OPT CP = (loglogn- 1) ^ 6 e 

e£E 

= Yl be + Yl be + • • • + Yl be 

eeA 1 e£A 2 eG^ioglogn. 

< OPTcpiA!) + OPT CP {A 2 ) + ... + OPT CP {A loglogn ) 

< (loglogn)^. 

We thus have a factor (1 — log j^g n ) algorithm for binary trees by taking the maximum of 
the AiS. Since a binary tree of height log log n has at most n 2 subtrees, and each Ai can have 
at most log jo g n trees, and since there are log log n Aj's, the total time taken for this algorithm 
is ° dog log n 7 * 2 loglogn) = 0(n 3 ). 
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Note that we can trade the approximation factor for running time as follows. For fixed 
e < 1, take N = |~^~|. Now, instead of taking subtrees of height at most log log n in the A^s 
take them to be of height at most N. We can use the recursion- (pQ) to solve for the subtrees 
of height at most N in time 0(n2 2N ). Using the same analysis as above, we get a (1 — e) 
factor algorithm that runs in 0(f2 2 ' El ) time. 

4 Common Prefix on Stars 

In this section we prove that the Common Prefix problem on stars is equivalent to a problem of 
finding large nested neighborhoods in bipartite graphs. We shall use this in the next section 
to prove a hardness of approximation result for Common Prefix . Consider the following 
problem. 

Definition 1 Nested Neighborhoods : Given a bipartite graph G = (U, V, E) with U and 
V as its bipartition and E as its edge set, find subsets U' C U and V C V , such that the 
elements of U' can be ordered as u±, U2, ■ ■ ■ , u\u'\> with T(ui) Pi V 3 Y{u,2) Pi V' 2 ••• 
r(u|{//|) n V , and such that \T(ui) n V'\ + |T(u2) n V'\ + . . . + |r(ui[//|) n V'| is maximized. 

Note that the above problem is independent of whether we choose the subset from U or from 
V, since V can be labeled to get a feasible solution of the same cost. We show that this 
problem is equivalent to the Common Prefix problem on stars. 

Suppose G = (U,V,E) is an instance of Nested Neighborhoods. Consider a star T with 
leaf nodes corresponding to the vertices in U and a vertex r U as the non-leaf vertex. We 
treat the vertex set V as a set of labels to be assigned to vertices of T. The vertex r is given 
the entire set V as its set of labels, while each of the remaining vertices u G U is assigned 
the label set T(u) C V. We thus have a Common Prefix instance on T. If u±,U2, ■ ■ ■ ,u\ut\ 
and V is feasible for Nested Neighborhoods on G, then we can construct a feasible solution 
for Common Prefix on T, with the same cost, by choosing a permutation of V that has the 
labels of T(u\ui\) n V first, followed by those of F(u\jj>i i) Pi V \ T(u\jjn) and so on. Thus the 
Nested Neighborhoods problem reduces to the Common Prefix problem on stars. 
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Conversely, if T is star in an instance of Common Prefix , with £ as the label set of 
the non-leaf vertex r and Sj as the label set of each leaf Ui, then we construct a Nested 
Neighborhoods instance as follows. The bipartition has the vertex sets U, which consists of 
all the leaf nodes of T, and V which consists of the set of labels S on r. A vertex m G U is 
connected by an edge to a vertex v s G V, if the corresponding label s G £ belongs to the label 
set Sj of U{. Using an argument similar to that in the previous paragraph, it can be shown 
that each feasible solution to Common Prefix on T has a corresponding feasible solution to 
Nested Neighborhoods on G, with the same cost. We thus have the following result. 

Theorem 1 The Nested neighborhoods problem is equivalent to the Common Prefix problem 
on an appropriate star. 

We note that these are approximation preserving reduction. Prom now on, we deal with 
the Nested Neighborhoods problem. 

5 Edge Bicliques Problem 

Let G = (U, V, E) be a bipartite graph with U and V as its bipartition and E as its set of 
edges. If B is a subset of the vertex set (U U V"), the subgraph induced by B is said to be a 
biclique if uv G E for all u G B n U and v G B n V. The Maximum Edge Biclique (EBCS) 
problem asks for a subgraph of a given bipartite graph, which is a biclique and has the largest 
number of edges. 

Lemma 1 Let G = (U,V,E) be a bipartite graph, and let OPTebcs and OPTqp be the 
optimal values of the EBCS and the Nested Neighborhoods problem on G. Then OPTebcs < 
OPT NN . 

Proof: Suppose that U' = {u\,U2, . . . ,Uk} C U and V' = {^1,^2, ■ ■ ■ , v{\ C V is a biclique. 
Since Y(u\) fl V' = T(v,2) DV' = ... = T(uk) H V, this corresponds to a feasible solution of 
the Nested Neighborhoods problem, with the same cost. □ 
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Note that the above proof shows the stronger result that every feasible solution to EBCS 
has a corresponding feasible solution to Nested Neighborhoods with at least as much cost. 

Lemma 2 Let G = (U, V, E) be a bipartite graph. If it has a feasible solution to Nested 
Neighborhoods of cost c, then G contains a biclique with at least jj- edges, where H n denotes 
the n harmonic number and \U\ = n. 

Proof: Let U' and V be a feasible solution to the Nested Neighborhoods problem of cost c, 
with {u!,u 2 , ...,u k } = U' and such that T(ui) T(u 2 ) D V D . . . D T(u k ) n V . Each 

vertex subset of the form u\, ■ ■ ■ , U{ along with V r(itj) forms a biclique. It is easy 
to see that if the largest biclique in the subgraph P, induced by U' U V, contains itj, then it 
also contains all vertices Uj for j < i. Let e be the size of the largest biclique in P and let yi 
denote |r(uj) n V'\. The biclique induced by Ui, u%, . . . , U{ and V PijZi r(uj) has i x yi edges. 
Hence, for each i = 1, . . . , k, yi < e/i. We now have 

c = y\ + y2 + • • • + yk 

l l l N 
* (1 + 2 + 3 + '- + A: )e 

< H n e. 

This proves the lemma. □ 
Combining lemma ([!]) and lemma (|2|) we get the following. 



OPT EB cs < OPT NN < H n OPT EBC s 

There are graphs for which the inequality on the right is tight. Consider the bipartite 
graph G = (U, V, E), with U = {ui, ■ ■ ■ , u n } and V = {v%, v 2 , • • • , v n } and the edges defined 
by the relation T(ui) = {vi,v 2 , . . . ,us}. It is easily seen that every edge occurs in the optimal 
solution to Nested Neighborhoods . Thus OPTnn = n + (n/2) + (re/3) + . . . + (n/n) = nH n . 
Further, if k is the largest index of a vertex in U in an optimal solution to EBCS, then every 
vertex Ui is in the optimal for i < k, so that OPTebcs = k(n/k) = n. Thus f? ] ^ r TNN = H n 
for this graph. 
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6 Hardness of Common Prefix on Stars 



We will need the following result of Feige and Kogan. 
Theorem 2 Feige- Kogan [2 J 

If the maximum edge biclique problem can be approximated within a factor of2^ ogn ^ S for every 
constant 5 > 0, then 3-SAT can be solved in time 2 n ' 1/4+t for every constant e > 0. 

Suppose that there is an algorithm that approximates Nested Neighborhoods on stars 
within a factor of a, i.e. if it returns the value A, then OPTqp < A < cxOPTnn • Then using 
lemma- ([2]), we know that the bipartite graph contains a feasible solution to EBCS of size A 1 , 
such that A < H n A'. We then get an a/H n factor algorithm for EBCS, since 



A' > — 



> -^-OPT NN 

> ^-OPTebcs- 

tin 



Thus, using theorems (pQ) and ([2]), we get the following hardness result. 



Theorem 3 If the Common Prefix problem for stars can be approximated within a factor of 
2(iogn) ,5 -iogiogn j or ever y cons t an t S > 0, then 3-SAT can be solved in time 2 n3/4+e for every 
constant e > 0. 



7 Acknowledgments 

We thank Ravindra Guravannavar for posing this problem. 



7 



References 



[1] Ravindra Guravannavar, S. Sudarshan, Ajit A. Diwan, Ch. Sobhan Babu, Reducing Order 
Enforcement Cost in Complex Query Plans. Manuscript, November 2006. Available at 
|http : //arxiv . org/abs/cs . DB/061 1094| 

[2] Uriel Feige, Shimon Kogan, Hardness of Approximation of The Balanced 
Complete Bipartite Subgraph Problem. Manuscript, May 2004. Available at 



http : //research. microsoft . com/theory/f eige/ 



8 



